The data is related with direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict if the client will subscribe a term deposit (variable y).
Data available at UCI machine learning repository here
Citation Request: [Moro et al., 2014] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014
Input variables: The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be (or not) subscribed.
The classification goal is to predict if the client will subscribe a term deposit (variable y).
Number of Instances: 45211 for bank-full.csv (4521 for bank.csv)
Number of Attributes: 16 + output attribute.
Attribute information:
For more information, read [Moro et al., 2011].
Input variables: # bank client data: 1 - age (numeric) 2 - job : type of job (categorical: “admin.”,“unknown”,“unemployed”,“management”,“housemaid”,“entrepreneur”,“student”, “blue-collar”,“self-employed”,“retired”,“technician”,“services”) 3 - marital : marital status (categorical: “married”,“divorced”,“single”; note: “divorced” means divorced or widowed) 4 - education (categorical: “unknown”,“secondary”,“primary”,“tertiary”) * 5 - default: has credit in default? (binary: “yes”,“no”) 6 - balance: average yearly balance, in euros (numeric) 7 - housing: has housing loan? (binary: “yes”,“no”) 8 - loan: has personal loan? (binary: “yes”,“no”) # related with the last contact of the current campaign: 9 - contact: contact communication type (categorical: “unknown”,“telephone”,“cellular”) 10 - day: last contact day of the month (numeric) 11 - month: last contact month of year (categorical: “jan”, “feb”, “mar”, …, “nov”, “dec”) 12 - duration: last contact duration, in seconds (numeric) # other attributes: 13 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact) 14 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means client was not previously contacted) 15 - previous: number of contacts performed before this campaign and for this client (numeric) *16 - poutcome: outcome of the previous marketing campaign (categorical: “unknown”,“other”,“failure”,“success”)
Output variable (desired target): *17 - y - has the client subscribed a term deposit? (binary: “yes”,“no”)
*8. Missing Attribute Values: None
pacman::p_load("lubridate", "dplyr", "magrittr")
library(rio)
library(doParallel)
library(viridis)
library(RColorBrewer)
library(tidyverse)
library(ggthemes)
library(knitr)
library(tidyverse)
library(caret)
library(caretEnsemble)
library(plotly)
library(lime)
library(plotROC)
# Calculate the number of cores
no_cores <- detectCores() - 1
cl<-makeCluster(no_cores)
registerDoParallel(cl)
setwd("/Users/nanaakwasiabayieboateng/Documents/memphisclassesbooks/DataMiningscience")
#load excel file with rio
Data<- rio::import("/Users/nanaakwasiabayieboateng/Documents/memphisclassesbooks/DataMiningscience/Anomalydetection/bank/bank-full.csv")
Warning: closing unused connection 9 (<-localhost:11278)
Warning: closing unused connection 8 (<-localhost:11278)
Warning: closing unused connection 7 (<-localhost:11278)
Warning: closing unused connection 6 (<-localhost:11278)
Warning: closing unused connection 5 (<-localhost:11278)
Warning: closing unused connection 4 (<-localhost:11278)
Warning: closing unused connection 3 (<-localhost:11278)
Data%>%head
#%>%kable()
summary(Data)
age job marital education default
Min. :18.00 Length:45211 Length:45211 Length:45211 Length:45211
1st Qu.:33.00 Class :character Class :character Class :character Class :character
Median :39.00 Mode :character Mode :character Mode :character Mode :character
Mean :40.94
3rd Qu.:48.00
Max. :95.00
balance housing loan contact day
Min. : -8019 Length:45211 Length:45211 Length:45211 Min. : 1.00
1st Qu.: 72 Class :character Class :character Class :character 1st Qu.: 8.00
Median : 448 Mode :character Mode :character Mode :character Median :16.00
Mean : 1362 Mean :15.81
3rd Qu.: 1428 3rd Qu.:21.00
Max. :102127 Max. :31.00
month duration campaign pdays previous
Length:45211 Min. : 0.0 Min. : 1.000 Min. : -1.0 Min. : 0.0000
Class :character 1st Qu.: 103.0 1st Qu.: 1.000 1st Qu.: -1.0 1st Qu.: 0.0000
Mode :character Median : 180.0 Median : 2.000 Median : -1.0 Median : 0.0000
Mean : 258.2 Mean : 2.764 Mean : 40.2 Mean : 0.5803
3rd Qu.: 319.0 3rd Qu.: 3.000 3rd Qu.: -1.0 3rd Qu.: 0.0000
Max. :4918.0 Max. :63.000 Max. :871.0 Max. :275.0000
poutcome y
Length:45211 Length:45211
Class :character Class :character
Mode :character Mode :character
There are no missing observations in the data.
#==================================================================
#check the number of missing rows
#==================================================================
colSums(is.na.data.frame(Data))
age job marital education default balance housing loan contact day
0 0 0 0 0 0 0 0 0 0
month duration campaign pdays previous poutcome y
0 0 0 0 0 0 0
#==================================================================
# descriptive/summary statistics
#==================================================================
Hmisc::describe.data.frame(Data)
Data
17 Variables 45211 Observations
--------------------------------------------------------------------------------------------------------
age
n missing distinct Info Mean Gmd .05 .10 .25 .50 .75
45211 0 77 0.999 40.94 11.87 27 29 33 39 48
.90 .95
56 59
lowest : 18 19 20 21 22, highest: 90 92 93 94 95
--------------------------------------------------------------------------------------------------------
job
n missing distinct
45211 0 12
Value admin. blue-collar entrepreneur housemaid management retired
Frequency 5171 9732 1487 1240 9458 2264
Proportion 0.114 0.215 0.033 0.027 0.209 0.050
Value self-employed services student technician unemployed unknown
Frequency 1579 4154 938 7597 1303 288
Proportion 0.035 0.092 0.021 0.168 0.029 0.006
--------------------------------------------------------------------------------------------------------
marital
n missing distinct
45211 0 3
Value divorced married single
Frequency 5207 27214 12790
Proportion 0.115 0.602 0.283
--------------------------------------------------------------------------------------------------------
education
n missing distinct
45211 0 4
Value primary secondary tertiary unknown
Frequency 6851 23202 13301 1857
Proportion 0.152 0.513 0.294 0.041
--------------------------------------------------------------------------------------------------------
default
n missing distinct
45211 0 2
Value no yes
Frequency 44396 815
Proportion 0.982 0.018
--------------------------------------------------------------------------------------------------------
balance
n missing distinct Info Mean Gmd .05 .10 .25 .50 .75
45211 0 7168 1 1362 2054 -172 0 72 448 1428
.90 .95
3574 5768
lowest : -8019 -6847 -4057 -3372 -3313, highest: 66721 71188 81204 98417 102127
--------------------------------------------------------------------------------------------------------
housing
n missing distinct
45211 0 2
Value no yes
Frequency 20081 25130
Proportion 0.444 0.556
--------------------------------------------------------------------------------------------------------
loan
n missing distinct
45211 0 2
Value no yes
Frequency 37967 7244
Proportion 0.84 0.16
--------------------------------------------------------------------------------------------------------
contact
n missing distinct
45211 0 3
Value cellular telephone unknown
Frequency 29285 2906 13020
Proportion 0.648 0.064 0.288
--------------------------------------------------------------------------------------------------------
day
n missing distinct Info Mean Gmd .05 .10 .25 .50 .75
45211 0 31 0.999 15.81 9.576 3 5 8 16 21
.90 .95
28 29
lowest : 1 2 3 4 5, highest: 27 28 29 30 31
--------------------------------------------------------------------------------------------------------
month
n missing distinct
45211 0 12
Value apr aug dec feb jan jul jun mar may nov oct sep
Frequency 2932 6247 214 2649 1403 6895 5341 477 13766 3970 738 579
Proportion 0.065 0.138 0.005 0.059 0.031 0.153 0.118 0.011 0.304 0.088 0.016 0.013
--------------------------------------------------------------------------------------------------------
duration
n missing distinct Info Mean Gmd .05 .10 .25 .50 .75
45211 0 1573 1 258.2 235.4 35 58 103 180 319
.90 .95
548 751
lowest : 0 1 2 3 4, highest: 3366 3422 3785 3881 4918
--------------------------------------------------------------------------------------------------------
campaign
n missing distinct Info Mean Gmd .05 .10 .25 .50 .75
45211 0 48 0.918 2.764 2.383 1 1 1 2 3
.90 .95
5 8
lowest : 1 2 3 4 5, highest: 50 51 55 58 63
--------------------------------------------------------------------------------------------------------
pdays
n missing distinct Info Mean Gmd .05 .10 .25 .50 .75
45211 0 559 0.454 40.2 71.61 -1 -1 -1 -1 -1
.90 .95
185 317
lowest : -1 1 2 3 4, highest: 838 842 850 854 871
--------------------------------------------------------------------------------------------------------
previous
n missing distinct Info Mean Gmd .05 .10 .25 .50 .75
45211 0 41 0.454 0.5803 1.044 0 0 0 0 0
.90 .95
2 3
lowest : 0 1 2 3 4, highest: 41 51 55 58 275
--------------------------------------------------------------------------------------------------------
poutcome
n missing distinct
45211 0 4
Value failure other success unknown
Frequency 4901 1840 1511 36959
Proportion 0.108 0.041 0.033 0.817
--------------------------------------------------------------------------------------------------------
y
n missing distinct
45211 0 2
Value no yes
Frequency 39922 5289
Proportion 0.883 0.117
--------------------------------------------------------------------------------------------------------
#describe(Data)
#==================================================================
# Histograms
#==================================================================
theme_set(theme_economist_white())
#ggplot(Data) + geom_boxplot(aes(x =age,y=duration,color=y))
ggplot(Data, aes(x ="",y=age, fill=y))+ geom_boxplot()+labs(x="age",y="")
#ggplotly(p)
ggplot(Data, aes(x =duration, fill=y))+ geom_histogram(bins = 30)
#ggplotly()
ggplot(Data, aes(x =age, fill=y))+ geom_histogram(bins = 30)
# ggplotly()
ggplot(Data, aes(x =day, fill=y))+ geom_histogram(bins = 30)
#ggplotly()
ggplot(Data, aes(x =balance, fill=y))+ geom_histogram(bins = 30)
#ggplotly()
ggplot(Data, aes(x =age, fill=y))+ geom_histogram(bins = 30)
# ggplotly()
# geom_density(alpha=1/3,color="red") + scale_fill_hue()
ggplot(Data, aes(x=age, fill=y)) + geom_density(alpha=1/3) + scale_fill_hue()
ggplotly()
NA
Convert character variables to factor variables.This is neccessary for the caret package to train the models we are interested in later.
Data<-Data %>% mutate_if(is.character, as.factor)
str(Data)
'data.frame': 45211 obs. of 17 variables:
$ age : int 58 44 33 47 33 35 28 42 58 43 ...
$ job : Factor w/ 12 levels "admin.","blue-collar",..: 5 10 3 2 12 5 5 3 6 10 ...
$ marital : Factor w/ 3 levels "divorced","married",..: 2 3 2 2 3 2 3 1 2 3 ...
$ education: Factor w/ 4 levels "primary","secondary",..: 3 2 2 4 4 3 3 3 1 2 ...
$ default : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 2 1 1 ...
$ balance : int 2143 29 2 1506 1 231 447 2 121 593 ...
$ housing : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 2 2 2 2 ...
$ loan : Factor w/ 2 levels "no","yes": 1 1 2 1 1 1 2 1 1 1 ...
$ contact : Factor w/ 3 levels "cellular","telephone",..: 3 3 3 3 3 3 3 3 3 3 ...
$ day : int 5 5 5 5 5 5 5 5 5 5 ...
$ month : Factor w/ 12 levels "apr","aug","dec",..: 9 9 9 9 9 9 9 9 9 9 ...
$ duration : int 261 151 76 92 198 139 217 380 50 55 ...
$ campaign : int 1 1 1 1 1 1 1 1 1 1 ...
$ pdays : int -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
$ previous : int 0 0 0 0 0 0 0 0 0 0 ...
$ poutcome : Factor w/ 4 levels "failure","other",..: 4 4 4 4 4 4 4 4 4 4 ...
$ y : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
ggplot(Data, aes(x =y))+ geom_histogram(bins = 4,stat="count")+labs(x="Term Deposit")
Ignoring unknown parameters: binwidth, bins, pad
#==================================================================
#Converting outcome variable to numeric
#==================================================================
Data$y<-ifelse(Data$y=='no',0,1)
str(Data)
'data.frame': 45211 obs. of 17 variables:
$ age : int 58 44 33 47 33 35 28 42 58 43 ...
$ job : chr "management" "technician" "entrepreneur" "blue-collar" ...
$ marital : chr "married" "single" "married" "married" ...
$ education: chr "tertiary" "secondary" "secondary" "unknown" ...
$ default : chr "no" "no" "no" "no" ...
$ balance : int 2143 29 2 1506 1 231 447 2 121 593 ...
$ housing : chr "yes" "yes" "yes" "yes" ...
$ loan : chr "no" "no" "yes" "no" ...
$ contact : chr "unknown" "unknown" "unknown" "unknown" ...
$ day : int 5 5 5 5 5 5 5 5 5 5 ...
$ month : chr "may" "may" "may" "may" ...
$ duration : int 261 151 76 92 198 139 217 380 50 55 ...
$ campaign : int 1 1 1 1 1 1 1 1 1 1 ...
$ pdays : int -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
$ previous : int 0 0 0 0 0 0 0 0 0 0 ...
$ poutcome : chr "unknown" "unknown" "unknown" "unknown" ...
$ y : num 0 0 0 0 0 0 0 0 0 0 ...
glimpse(Data)
Observations: 45,211
Variables: 17
$ age <int> 58, 44, 33, 47, 33, 35, 28, 42, 58, 43, 41, 29, 53, 58, 57, 51, 45, 57, 60, 33, 2...
$ job <chr> "management", "technician", "entrepreneur", "blue-collar", "unknown", "management...
$ marital <chr> "married", "single", "married", "married", "single", "married", "single", "divorc...
$ education <chr> "tertiary", "secondary", "secondary", "unknown", "unknown", "tertiary", "tertiary...
$ default <chr> "no", "no", "no", "no", "no", "no", "no", "yes", "no", "no", "no", "no", "no", "n...
$ balance <int> 2143, 29, 2, 1506, 1, 231, 447, 2, 121, 593, 270, 390, 6, 71, 162, 229, 13, 52, 6...
$ housing <chr> "yes", "yes", "yes", "yes", "no", "yes", "yes", "yes", "yes", "yes", "yes", "yes"...
$ loan <chr> "no", "no", "yes", "no", "no", "no", "yes", "no", "no", "no", "no", "no", "no", "...
$ contact <chr> "unknown", "unknown", "unknown", "unknown", "unknown", "unknown", "unknown", "unk...
$ day <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, ...
$ month <chr> "may", "may", "may", "may", "may", "may", "may", "may", "may", "may", "may", "may...
$ duration <int> 261, 151, 76, 92, 198, 139, 217, 380, 50, 55, 222, 137, 517, 71, 174, 353, 98, 38...
$ campaign <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
$ pdays <int> -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -...
$ previous <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ poutcome <chr> "unknown", "unknown", "unknown", "unknown", "unknown", "unknown", "unknown", "unk...
$ y <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
Convert Categorical varialbes to dummy variables using either Model.matrix or sparse.model,matrix
predictors<-setdiff(names(Data),Data$y)
#names(Data)
predictors<-names(Data[,-17])
#paste0(predictors,sep="",collapse = "+ ")
#paste("~",paste0(predictors,sep="",collapse = "+ "))
as.formula(paste("~",paste0(predictors,sep="",collapse = "+ ")))
~age + job + marital + education + default + balance + housing +
loan + contact + day + month + duration + campaign + pdays +
previous + poutcome
d1<-model.matrix(as.formula(paste("~",paste0(predictors,sep="",collapse = "+ "))), Data)
d1b<-Matrix::sparse.model.matrix(as.formula(paste("~",paste0(predictors,sep="",collapse = "+ "))), Data)
#d1<-model.matrix(~age+job+marital+education + default + balance + housing +
# loan + contact + day + month + duration + campaign + pdays +
# previous + poutcome,Data)
#d1b<-Matrix::sparse.model.matrix(~age+job+marital+education + default + balance + housing +
# loan + contact + day + month + duration + campaign + pdays +
# previous + poutcome,Data)
head(d1)
(Intercept) age jobblue-collar jobentrepreneur jobhousemaid jobmanagement jobretired jobself-employed
1 1 58 0 0 0 1 0 0
2 1 44 0 0 0 0 0 0
3 1 33 0 1 0 0 0 0
4 1 47 1 0 0 0 0 0
5 1 33 0 0 0 0 0 0
6 1 35 0 0 0 1 0 0
jobservices jobstudent jobtechnician jobunemployed jobunknown maritalmarried maritalsingle
1 0 0 0 0 0 1 0
2 0 0 1 0 0 0 1
3 0 0 0 0 0 1 0
4 0 0 0 0 0 1 0
5 0 0 0 0 1 0 1
6 0 0 0 0 0 1 0
educationsecondary educationtertiary educationunknown defaultyes balance housingyes loanyes
1 0 1 0 0 2143 1 0
2 1 0 0 0 29 1 0
3 1 0 0 0 2 1 1
4 0 0 1 0 1506 1 0
5 0 0 1 0 1 0 0
6 0 1 0 0 231 1 0
contacttelephone contactunknown day monthaug monthdec monthfeb monthjan monthjul monthjun monthmar
1 0 1 5 0 0 0 0 0 0 0
2 0 1 5 0 0 0 0 0 0 0
3 0 1 5 0 0 0 0 0 0 0
4 0 1 5 0 0 0 0 0 0 0
5 0 1 5 0 0 0 0 0 0 0
6 0 1 5 0 0 0 0 0 0 0
monthmay monthnov monthoct monthsep duration campaign pdays previous poutcomeother poutcomesuccess
1 1 0 0 0 261 1 -1 0 0 0
2 1 0 0 0 151 1 -1 0 0 0
3 1 0 0 0 76 1 -1 0 0 0
4 1 0 0 0 92 1 -1 0 0 0
5 1 0 0 0 198 1 -1 0 0 0
6 1 0 0 0 139 1 -1 0 0 0
poutcomeunknown
1 1
2 1
3 1
4 1
5 1
6 1
head(d1b)
[1] 1 1 1 1 1 1
The dummy conversion results in 42 variables.
#==================================================================
#convert categorical variables to numeric variables
#==================================================================
dmy <- dummyVars(" ~ .", data = Data,fullRank = T)
transformed <- data.frame(predict(dmy, newdata =Data))
#Checking the structure of transformed train file
str(transformed)
'data.frame': 45211 obs. of 43 variables:
$ age : num 58 44 33 47 33 35 28 42 58 43 ...
$ jobblue.collar : num 0 0 0 1 0 0 0 0 0 0 ...
$ jobentrepreneur : num 0 0 1 0 0 0 0 1 0 0 ...
$ jobhousemaid : num 0 0 0 0 0 0 0 0 0 0 ...
$ jobmanagement : num 1 0 0 0 0 1 1 0 0 0 ...
$ jobretired : num 0 0 0 0 0 0 0 0 1 0 ...
$ jobself.employed : num 0 0 0 0 0 0 0 0 0 0 ...
$ jobservices : num 0 0 0 0 0 0 0 0 0 0 ...
$ jobstudent : num 0 0 0 0 0 0 0 0 0 0 ...
$ jobtechnician : num 0 1 0 0 0 0 0 0 0 1 ...
$ jobunemployed : num 0 0 0 0 0 0 0 0 0 0 ...
$ jobunknown : num 0 0 0 0 1 0 0 0 0 0 ...
$ maritalmarried : num 1 0 1 1 0 1 0 0 1 0 ...
$ maritalsingle : num 0 1 0 0 1 0 1 0 0 1 ...
$ educationsecondary: num 0 1 1 0 0 0 0 0 0 1 ...
$ educationtertiary : num 1 0 0 0 0 1 1 1 0 0 ...
$ educationunknown : num 0 0 0 1 1 0 0 0 0 0 ...
$ defaultyes : num 0 0 0 0 0 0 0 1 0 0 ...
$ balance : num 2143 29 2 1506 1 ...
$ housingyes : num 1 1 1 1 0 1 1 1 1 1 ...
$ loanyes : num 0 0 1 0 0 0 1 0 0 0 ...
$ contacttelephone : num 0 0 0 0 0 0 0 0 0 0 ...
$ contactunknown : num 1 1 1 1 1 1 1 1 1 1 ...
$ day : num 5 5 5 5 5 5 5 5 5 5 ...
$ monthaug : num 0 0 0 0 0 0 0 0 0 0 ...
$ monthdec : num 0 0 0 0 0 0 0 0 0 0 ...
$ monthfeb : num 0 0 0 0 0 0 0 0 0 0 ...
$ monthjan : num 0 0 0 0 0 0 0 0 0 0 ...
$ monthjul : num 0 0 0 0 0 0 0 0 0 0 ...
$ monthjun : num 0 0 0 0 0 0 0 0 0 0 ...
$ monthmar : num 0 0 0 0 0 0 0 0 0 0 ...
$ monthmay : num 1 1 1 1 1 1 1 1 1 1 ...
$ monthnov : num 0 0 0 0 0 0 0 0 0 0 ...
$ monthoct : num 0 0 0 0 0 0 0 0 0 0 ...
$ monthsep : num 0 0 0 0 0 0 0 0 0 0 ...
$ duration : num 261 151 76 92 198 139 217 380 50 55 ...
$ campaign : num 1 1 1 1 1 1 1 1 1 1 ...
$ pdays : num -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
$ previous : num 0 0 0 0 0 0 0 0 0 0 ...
$ poutcomeother : num 0 0 0 0 0 0 0 0 0 0 ...
$ poutcomesuccess : num 0 0 0 0 0 0 0 0 0 0 ...
$ poutcomeunknown : num 1 1 1 1 1 1 1 1 1 1 ...
$ y : num 0 0 0 0 0 0 0 0 0 0 ...
#==================================================================
#Converting the dependent variable back to categorical
#==================================================================
transformed$y<-as.factor(transformed$y)
#==================================================================
#Spliting training set into two parts based on outcome: 70% and 30%
#==================================================================
index<-createDataPartition(transformed$y,p=0.70, list=FALSE)
trainSet<-transformed[index,]
testSet<-transformed[-index,]
outcomeName<-'y'
predictors<-names(trainSet)[!names(trainSet) %in% outcomeName]
predictors
[1] "age" "jobblue.collar" "jobentrepreneur" "jobhousemaid"
[5] "jobmanagement" "jobretired" "jobself.employed" "jobservices"
[9] "jobstudent" "jobtechnician" "jobunemployed" "jobunknown"
[13] "maritalmarried" "maritalsingle" "educationsecondary" "educationtertiary"
[17] "educationunknown" "defaultyes" "balance" "housingyes"
[21] "loanyes" "contacttelephone" "contactunknown" "day"
[25] "monthaug" "monthdec" "monthfeb" "monthjan"
[29] "monthjul" "monthjun" "monthmar" "monthmay"
[33] "monthnov" "monthoct" "monthsep" "duration"
[37] "campaign" "pdays" "previous" "poutcomeother"
[41] "poutcomesuccess" "poutcomeunknown"
#==================================================================
#Feature selection using rfe in caret(recursive feature extraction)
#predictors<-names(trainSet)[!names(trainSet) %in% outcomeName]
#Alternatively
#predictors<-setdiff(names(trainSet),outcomeName)
#==================================================================
library(randomForest)
randomForest 4.6-12
Type rfNews() to see new features/changes/bug fixes.
Attaching package: ‘randomForest’
The following object is masked from ‘package:Hmisc’:
combine
The following object is masked from ‘package:ggplot2’:
margin
The following object is masked from ‘package:dplyr’:
combine
Warning messages:
1: closing unused connection 9 (<-localhost:11966)
2: closing unused connection 8 (<-localhost:11966)
3: closing unused connection 7 (<-localhost:11966)
4: closing unused connection 6 (<-localhost:11966)
5: closing unused connection 5 (<-localhost:11966)
6: closing unused connection 4 (<-localhost:11966)
7: closing unused connection 3 (<-localhost:11966)
control <- rfeControl(functions = rfFuncs,
method = "repeatedcv",
repeats = 3,
verbose = FALSE)
outcomeName<-'y'
predictors<-names(trainSet)[!names(trainSet) %in% outcomeName]
feature_select <- rfe(trainSet[,predictors], trainSet[,outcomeName],
rfeControl = control)
#save(feature_select,file="feature_select.RData")
load("feature_select.RData")
#The top 5 variables (out of 42):
#print("The top 5 variables (out of 42)\n")
cat("The top 5 variables (out of 42)\n")
The top 5 variables (out of 42)
cat("duration, poutcomesuccess, monthmar, contactunknown, housingyes\n")
duration, poutcomesuccess, monthmar, contactunknown, housingyes
predictors(feature_select)
[1] "duration" "poutcomesuccess" "monthmar" "contactunknown"
[5] "housingyes" "monthoct" "age" "day"
[9] "monthsep" "pdays" "monthjul" "monthdec"
[13] "monthaug" "monthmay" "monthfeb" "monthjun"
[17] "previous" "monthnov" "poutcomeunknown" "monthjan"
[21] "campaign" "loanyes" "maritalmarried" "educationtertiary"
[25] "maritalsingle" "balance" "jobblue.collar" "contacttelephone"
[29] "jobmanagement" "jobstudent" "jobservices" "poutcomeother"
[33] "defaultyes" "jobretired" "educationsecondary" "educationunknown"
[37] "jobtechnician" "jobhousemaid" "jobunknown" "jobentrepreneur"
[41] "jobself.employed" "jobunemployed"
#===================================================================================
# plot variable selection
#===================================================================================
trellis.par.set(caretTheme())
Note: The default device has been opened to honour attempt to modify trellis settings
plot(feature_select, type = c( "o","g"))
About 8(14 variables ) features provides the optimal accuracy for training.
The top 5 variables provides an accuracy of about 90% for the data. The remaining 36 variables add less than 0.1 . This is the advantage of feature engineering. It helps to reduce complexity in the model, reduce overfitting and also computationaly time.
#===================================================================================
#Taking only the top 5 predictors
#Age, Employment.Primarily.retired..or, Education.10th.Grade,
#Employment.Unable.to.work., race_black.Yes from feature_select
# Cs function from Hmisc converts names to character variables
#===================================================================================
library(Hmisc)
p=c(paste0(predictors(feature_select),sep=",",collapse = ""))
#trainSet[,]
predictor=Hmisc::Cs(duration,poutcome.success,month.mar,contact.unknown,age,housing.yes,day,month.oct,pdays,month.sep,month.jul,month.dec,month.may,month.aug,campaign,month.jun )
print("The set of features selected is:\n")
[1] "The set of features selected is:\n"
p
[1] "duration,poutcomesuccess,monthmar,contactunknown,housingyes,monthoct,age,day,monthsep,pdays,monthjul,monthdec,monthaug,monthmay,monthfeb,monthjun,previous,monthnov,poutcomeunknown,monthjan,campaign,loanyes,maritalmarried,educationtertiary,maritalsingle,balance,jobblue.collar,contacttelephone,jobmanagement,jobstudent,jobservices,poutcomeother,defaultyes,jobretired,educationsecondary,educationunknown,jobtechnician,jobhousemaid,jobunknown,jobentrepreneur,jobself.employed,jobunemployed,"
trainSet[,outcomeName]%>%head()
[1] 0 0 0 0 0 0
Levels: 0 1
class(trainSet[,outcomeName])
[1] "factor"
Topfivepred = Hmisc::Cs(duration, poutcomesuccess, monthmar, contactunknown, housingyes)
trainSet[,Topfivepred]%>%head()
Logistic Regression Model
model_glm<-train(trainSet[,Topfivepred],as.factor(trainSet$y),method='glm',family="binomial")
summary(model_glm)
Call:
NULL
Deviance Residuals:
Min 1Q Median 3Q Max
-5.6320 -0.4069 -0.2862 -0.1763 3.1529
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.989e+00 3.847e-02 -77.69 <2e-16 ***
duration 3.994e-03 7.338e-05 54.42 <2e-16 ***
poutcomesuccess 2.617e+00 7.374e-02 35.49 <2e-16 ***
monthmar 2.175e+00 1.258e-01 17.29 <2e-16 ***
contactunknown -1.301e+00 6.729e-02 -19.34 <2e-16 ***
housingyes -7.931e-01 4.373e-02 -18.14 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 22845 on 31648 degrees of freedom
Residual deviance: 16233 on 31643 degrees of freedom
AIC: 16245
Number of Fisher Scoring iterations: 6
The accuracy of the logistic regression model about 90.4 %
#save(model_glm,file="model_glm.RData")
load("model_glm.RData")
# Predict using the test data
pred<-predict(model_glm,testSet)
my_data=data.frame(cbind(predicted=pred,observed=testSet$y))
ggplot(my_data,aes(predicted,observed))+geom_point()+geom_smooth(method=lm)+ggtitle('logistic model')
# Print, plot variable importance
print(varImp(model_glm, scale = FALSE))
glm variable importance
Overall
duration 54.92
poutcomesuccess 36.58
contactunknown 18.68
housingyes 17.54
monthmar 16.91
plot(varImp(model_glm, scale = FALSE), main="Variable Importance using logistic/glm")
confusionMatrix(testSet$y,pred)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 11704 272
1 1063 523
Accuracy : 0.9016
95% CI : (0.8964, 0.9065)
No Information Rate : 0.9414
P-Value [Acc > NIR] : 1
Kappa : 0.3918
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.9167
Specificity : 0.6579
Pos Pred Value : 0.9773
Neg Pred Value : 0.3298
Prevalence : 0.9414
Detection Rate : 0.8630
Detection Prevalence : 0.8831
Balanced Accuracy : 0.7873
'Positive' Class : 0
#==================================================================
#ROCR Curve
#==================================================================
library(pROC)
#install.packages("pROC")
# Compute AUC for predicting Class with the variable CreditHistory.Critical
f1 = roc(Data$y ~ Data$duration, data=trainSet)
plot(f1, col="red")
p=ggplot(trainSet, aes(d = y, m = duration)) + geom_roc()+ style_roc()
plot_interactive_roc((p))
#Draw the ROC curve
glm.probs <- predict(model_glm,testSet,type="prob")
head(glm.probs)
glm.ROC <- roc(predictor=glm.probs$PS,
response=testSet$y,
levels=rev(levels(testSet$y)))
Error in roc.default(predictor = glm.probs$PS, response = testSet$y, levels = rev(levels(testSet$y))) :
No valid data provided.
#===================================================================================
# multiple algorithms
#===================================================================================
fitControl <- trainControl(
method = "repeatedcv",
number = 5,classProbs = TRUE,
repeats = 5)
met=c("LogitBoost", 'xgbTree', 'rf', 'svmRadial')
re=list()
for (i in seq_along(met)) {
re[[i]]<-train(trainSet[,Topfivepred],trainSet[,outcomeName],method=met[i],
preProcess = c("center","scale"))
}
Save models to avoid running the model every time.
save(re,file="multiplealgorithms.RData")
Boosted Logistic Regression
re[[1]]%>%head()
$method
[1] "LogitBoost"
$modelInfo
$modelInfo$label
[1] "Boosted Logistic Regression"
$modelInfo$library
[1] "caTools"
$modelInfo$loop
function (grid)
{
loop <- grid[which.max(grid$nIter), , drop = FALSE]
submodels <- grid[-which.max(grid$nIter), , drop = FALSE]
submodels <- list(submodels)
list(loop = loop, submodels = submodels)
}
$modelInfo$type
[1] "Classification"
$modelInfo$parameters
$modelInfo$grid
function (x, y, len = NULL, search = "grid")
{
if (search == "grid") {
out <- data.frame(nIter = 1 + ((1:len) * 10))
}
else {
out <- data.frame(nIter = unique(sample(1:100, size = len,
replace = TRUE)))
}
out
}
$modelInfo$fit
function (x, y, wts, param, lev, last, classProbs, ...)
{
caTools::LogitBoost(as.matrix(x), y, nIter = param$nIter)
}
$modelInfo$predict
function (modelFit, newdata, submodels = NULL)
{
out <- caTools::predict.LogitBoost(modelFit, newdata, type = "class")
if (!is.null(submodels)) {
tmp <- out
out <- vector(mode = "list", length = nrow(submodels) +
1)
out[[1]] <- tmp
for (j in seq(along = submodels$nIter)) {
out[[j + 1]] <- caTools::predict.LogitBoost(modelFit,
newdata, nIter = submodels$nIter[j])
}
}
out
}
$modelInfo$prob
function (modelFit, newdata, submodels = NULL)
{
out <- caTools::predict.LogitBoost(modelFit, newdata, type = "raw")
out <- t(apply(out, 1, function(x) x/sum(x)))
if (!is.null(submodels)) {
tmp <- vector(mode = "list", length = nrow(submodels) +
1)
tmp[[1]] <- out
for (j in seq(along = submodels$nIter)) {
tmpProb <- caTools::predict.LogitBoost(modelFit,
newdata, type = "raw", nIter = submodels$nIter[j])
tmpProb <- out <- t(apply(tmpProb, 1, function(x) x/sum(x)))
tmp[[j + 1]] <- as.data.frame(tmpProb[, modelFit$obsLevels,
drop = FALSE])
}
out <- tmp
}
out
}
$modelInfo$predictors
function (x, ...)
{
if (!is.null(x$xNames)) {
out <- unique(x$xNames[x$Stump[, "feature"]])
}
else out <- NA
out
}
$modelInfo$levels
function (x)
x$obsLevels
$modelInfo$tags
[1] "Ensemble Model" "Boosting" "Implicit Feature Selection"
[4] "Tree-Based Model" "Logistic Regression"
$modelInfo$sort
function (x)
x[order(x[, 1]), ]
$modelType
[1] "Classification"
$results
$pred
NULL
$bestTune
re[[1]][[4]]%>%kable()
| nIter | Accuracy | Kappa | AccuracySD | KappaSD |
|---|---|---|---|---|
| 11 | 0.8954207 | 0.3578465 | 0.0039172 | 0.0544693 |
| 21 | 0.8935720 | 0.3555484 | 0.0041170 | 0.0617421 |
| 31 | 0.8940555 | 0.3667228 | 0.0052416 | 0.0519837 |
plot(re[[1]])
varImp(object=re[[1]])
ROC curve variable importance
Importance
duration 100.00
housingyes 31.53
contactunknown 30.61
poutcomesuccess 22.39
monthmar 0.00
# Predict using the test data
pred<-predict(re[[1]],testSet)
my_data=data.frame(cbind(predicted=pred,observed=testSet$y))
ggplot(my_data,aes(predicted,observed))+geom_point()+geom_smooth(method=lm)+ggtitle('Boosted Logistic Regression')
# Print, plot variable importance
print(varImp(re[[1]], scale = FALSE))
ROC curve variable importance
Importance
duration 0.8045
housingyes 0.6103
contactunknown 0.6077
poutcomesuccess 0.5844
monthmar 0.5209
plot(varImp(re[[1]], scale = FALSE), main="Boosted Logistic Regression")
confusionMatrix(testSet$y,pred)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 11370 606
1 1585 1
Accuracy : 0.8384
95% CI : (0.8321, 0.8446)
No Information Rate : 0.9552
P-Value [Acc > NIR] : 1
Kappa : -0.0682
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.8776534
Specificity : 0.0016474
Pos Pred Value : 0.9493988
Neg Pred Value : 0.0006305
Prevalence : 0.9552426
Detection Rate : 0.8383719
Detection Prevalence : 0.8830556
Balanced Accuracy : 0.4396504
'Positive' Class : 0
EXtreme Gradient Boosting
#re[[2]]
re[[2]][[4]]%>%head()%>%kable()
| eta | max_depth | gamma | colsample_bytree | min_child_weight | subsample | nrounds | Accuracy | Kappa | AccuracySD | KappaSD | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 0.3 | 1 | 0 | 0.6 | 1 | 0.50 | 50 | 0.8999284 | 0.3848753 | 0.0023361 | 0.0209991 |
| 4 | 0.3 | 1 | 0 | 0.6 | 1 | 0.75 | 50 | 0.8999285 | 0.3836727 | 0.0023861 | 0.0257059 |
| 7 | 0.3 | 1 | 0 | 0.6 | 1 | 1.00 | 50 | 0.8998351 | 0.3733176 | 0.0024587 | 0.0317696 |
| 10 | 0.3 | 1 | 0 | 0.8 | 1 | 0.50 | 50 | 0.8998809 | 0.3832630 | 0.0026702 | 0.0178440 |
| 13 | 0.3 | 1 | 0 | 0.8 | 1 | 0.75 | 50 | 0.9000034 | 0.3795111 | 0.0028604 | 0.0194459 |
| 16 | 0.3 | 1 | 0 | 0.8 | 1 | 1.00 | 50 | 0.8999448 | 0.3766056 | 0.0029003 | 0.0150799 |
plot(re[[2]])
varImp(object=re[[2]])
xgbTree variable importance
Overall
duration 100.000
poutcomesuccess 34.165
contactunknown 22.346
monthmar 7.693
housingyes 0.000
# Predict using the test data
pred<-predict(re[[2]],testSet)
my_data=data.frame(cbind(predicted=pred,observed=testSet$y))
ggplot(my_data,aes(predicted,observed))+geom_point()+geom_smooth(method=lm)+ggtitle('EXtreme Gradient Boosting')
# Print, plot variable importance
print(varImp(re[[2]], scale = FALSE))
xgbTree variable importance
Overall
duration 0.60900
poutcomesuccess 0.20806
contactunknown 0.13609
monthmar 0.04685
housingyes 0.00000
plot(varImp(re[[2]], scale = FALSE), main="EXtreme Gradient Boosting")
confusionMatrix(testSet$y,pred)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 11956 20
1 1574 12
Accuracy : 0.8825
95% CI : (0.8769, 0.8878)
No Information Rate : 0.9976
P-Value [Acc > NIR] : 1
Kappa : 0.0103
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.883666
Specificity : 0.375000
Pos Pred Value : 0.998330
Neg Pred Value : 0.007566
Prevalence : 0.997640
Detection Rate : 0.881581
Detection Prevalence : 0.883056
Balanced Accuracy : 0.629333
'Positive' Class : 0
Random Forest
re[[3]]
Random Forest
31649 samples
5 predictor
2 classes: '0', '1'
Pre-processing: centered (5), scaled (5)
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 31649, 31649, 31649, 31649, 31649, 31649, ...
Resampling results across tuning parameters:
mtry Accuracy Kappa
2 0.9004525 0.3781019
3 0.9006479 0.3986243
5 0.8831291 0.3440634
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mtry = 3.
re[[3]][[4]]%>%kable()
| mtry | Accuracy | Kappa | AccuracySD | KappaSD |
|---|---|---|---|---|
| 2 | 0.9004525 | 0.3781019 | 0.0019847 | 0.0127565 |
| 3 | 0.9006479 | 0.3986243 | 0.0023334 | 0.0132605 |
| 5 | 0.8831291 | 0.3440634 | 0.0023667 | 0.0104566 |
plot(re[[3]])
varImp(object=re[[3]])
rf variable importance
Overall
duration 100.000
poutcomesuccess 37.522
housingyes 1.609
contactunknown 1.115
monthmar 0.000
# Predict using the test data
pred<-predict(re[[3]],testSet)
my_data=data.frame(cbind(predicted=pred,observed=testSet$y))
ggplot(my_data,aes(predicted,observed))+geom_point()+geom_smooth(method=lm)+ggtitle('Random Forest')
# Print, plot variable importance
print(varImp(re[[3]], scale = FALSE))
rf variable importance
Overall
duration 1331.47
poutcomesuccess 561.86
housingyes 119.47
contactunknown 113.39
monthmar 99.65
plot(varImp(re[[3]], scale = FALSE), main="Random Forest")
confusionMatrix(testSet$y,pred)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 11682 294
1 1000 586
Accuracy : 0.9046
95% CI : (0.8995, 0.9095)
No Information Rate : 0.9351
P-Value [Acc > NIR] : 1
Kappa : 0.4275
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.9211
Specificity : 0.6659
Pos Pred Value : 0.9755
Neg Pred Value : 0.3695
Prevalence : 0.9351
Detection Rate : 0.8614
Detection Prevalence : 0.8831
Balanced Accuracy : 0.7935
'Positive' Class : 0
Support Vector Machines with Radial Basis Function Kernel
fitControl <- trainControl(method = "repeatedcv",number = 5,repeats = 5,
allowParallel = TRUE )
model_svm<-train(trainSet[,Topfivepred],trainSet[,outcomeName],method='svmRadial',
trControl=fitControl)
#predictions<-predict.train(object=re[[4]],testSet[,predictors],type="raw")
#predictions
model_svm
Support Vector Machines with Radial Basis Function Kernel
31649 samples
5 predictor
2 classes: '0', '1'
No pre-processing
Resampling: Cross-Validated (5 fold, repeated 5 times)
Summary of sample sizes: 25319, 25319, 25319, 25320, 25319, 25318, ...
Resampling results across tuning parameters:
C Accuracy Kappa
0.25 0.9014246 0.4084901
0.50 0.9011237 0.4078208
1.00 0.9011023 0.4077365
Tuning parameter 'sigma' was held constant at a value of 3.884677
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were sigma = 3.884677 and C = 0.25.
save(model_svm,file="model_svm.RData")
load("model_svm.RData")
plot(model_svm)
varImp(object=model_svm)
ROC curve variable importance
Importance
duration 100.00
housingyes 31.41
contactunknown 30.32
poutcomesuccess 23.13
monthmar 0.00
# Predict using the test data
#pred<-predict(model_svm,testSet)
predictions<-predict.train(object=model_svm,testSet[,Topfivepred],type="raw")
my_data=data.frame(cbind(predicted=predictions,observed=testSet$y))
ggplot(my_data,aes(predicted,observed))+geom_point()+geom_smooth(method=lm)+ggtitle('Support Vector Machines with Radial Basis Function Kernel')
# Print, plot variable importance
print(varImp(model_svm, scale = FALSE))
ROC curve variable importance
Importance
duration 0.8078
housingyes 0.6093
contactunknown 0.6062
poutcomesuccess 0.5854
monthmar 0.5185
plot(varImp(model_svm, scale = FALSE), main="Support Vector Machines with Radial Basis Function Kernel")
confusionMatrix(testSet$y,predictions)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 11657 319
1 996 590
Accuracy : 0.903
95% CI : (0.8979, 0.908)
No Information Rate : 0.933
P-Value [Acc > NIR] : 1
Kappa : 0.4239
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.9213
Specificity : 0.6491
Pos Pred Value : 0.9734
Neg Pred Value : 0.3720
Prevalence : 0.9330
Detection Rate : 0.8595
Detection Prevalence : 0.8831
Balanced Accuracy : 0.7852
'Positive' Class : 0
predictions%>%head()
[1] 0 0 0 0 0 0
Levels: 0 1
Find models that are supported by the caret package. There are over 200 models that can be implemented in the caret package at last count.
names(getModelInfo())
[1] "ada" "AdaBag" "AdaBoost.M1" "adaboost"
[5] "amdai" "ANFIS" "avNNet" "awnb"
[9] "awtan" "bag" "bagEarth" "bagEarthGCV"
[13] "bagFDA" "bagFDAGCV" "bam" "bartMachine"
[17] "bayesglm" "binda" "blackboost" "blasso"
[21] "blassoAveraged" "bridge" "brnn" "BstLm"
[25] "bstSm" "bstTree" "C5.0" "C5.0Cost"
[29] "C5.0Rules" "C5.0Tree" "cforest" "chaid"
[33] "CSimca" "ctree" "ctree2" "cubist"
[37] "dda" "deepboost" "DENFIS" "dnn"
[41] "dwdLinear" "dwdPoly" "dwdRadial" "earth"
[45] "elm" "enet" "evtree" "extraTrees"
[49] "fda" "FH.GBML" "FIR.DM" "foba"
[53] "FRBCS.CHI" "FRBCS.W" "FS.HGD" "gam"
[57] "gamboost" "gamLoess" "gamSpline" "gaussprLinear"
[61] "gaussprPoly" "gaussprRadial" "gbm_h2o" "gbm"
[65] "gcvEarth" "GFS.FR.MOGUL" "GFS.GCCL" "GFS.LT.RS"
[69] "GFS.THRIFT" "glm.nb" "glm" "glmboost"
[73] "glmnet_h2o" "glmnet" "glmStepAIC" "gpls"
[77] "hda" "hdda" "hdrda" "HYFIS"
[81] "icr" "J48" "JRip" "kernelpls"
[85] "kknn" "knn" "krlsPoly" "krlsRadial"
[89] "lars" "lars2" "lasso" "lda"
[93] "lda2" "leapBackward" "leapForward" "leapSeq"
[97] "Linda" "lm" "lmStepAIC" "LMT"
[101] "loclda" "logicBag" "LogitBoost" "logreg"
[105] "lssvmLinear" "lssvmPoly" "lssvmRadial" "lvq"
[109] "M5" "M5Rules" "manb" "mda"
[113] "Mlda" "mlp" "mlpML" "mlpSGD"
[117] "mlpWeightDecay" "mlpWeightDecayML" "monmlp" "msaenet"
[121] "multinom" "naive_bayes" "nb" "nbDiscrete"
[125] "nbSearch" "neuralnet" "nnet" "nnls"
[129] "nodeHarvest" "oblique.tree" "OneR" "ordinalNet"
[133] "ORFlog" "ORFpls" "ORFridge" "ORFsvm"
[137] "ownn" "pam" "parRF" "PART"
[141] "partDSA" "pcaNNet" "pcr" "pda"
[145] "pda2" "penalized" "PenalizedLDA" "plr"
[149] "pls" "plsRglm" "polr" "ppr"
[153] "PRIM" "protoclass" "pythonKnnReg" "qda"
[157] "QdaCov" "qrf" "qrnn" "randomGLM"
[161] "ranger" "rbf" "rbfDDA" "Rborist"
[165] "rda" "regLogistic" "relaxo" "rf"
[169] "rFerns" "RFlda" "rfRules" "ridge"
[173] "rlda" "rlm" "rmda" "rocc"
[177] "rotationForest" "rotationForestCp" "rpart" "rpart1SE"
[181] "rpart2" "rpartCost" "rpartScore" "rqlasso"
[185] "rqnc" "RRF" "RRFglobal" "rrlda"
[189] "RSimca" "rvmLinear" "rvmPoly" "rvmRadial"
[193] "SBC" "sda" "sdwd" "simpls"
[197] "SLAVE" "slda" "smda" "snn"
[201] "sparseLDA" "spikeslab" "spls" "stepLDA"
[205] "stepQDA" "superpc" "svmBoundrangeString" "svmExpoString"
[209] "svmLinear" "svmLinear2" "svmLinear3" "svmLinearWeights"
[213] "svmLinearWeights2" "svmPoly" "svmRadial" "svmRadialCost"
[217] "svmRadialSigma" "svmRadialWeights" "svmSpectrumString" "tan"
[221] "tanSearch" "treebag" "vbmpRadial" "vglmAdjCat"
[225] "vglmContRatio" "vglmCumulative" "widekernelpls" "WM"
[229] "wsrf" "xgbLinear" "xgbTree" "xyf"
#===================================================================================
#gradient boosted trees
# parameter tuning
#===================================================================================
fitControl <- trainControl(method = "repeatedcv",number = 5,repeats = 5,
allowParallel = TRUE )
#fitControl <- trainControl(
# method = "repeatedcv",
# number = 5,classProbs = TRUE,
# repeats = 5,allowParallel = TRUE)
#Creating grid
grid <- expand.grid(n.trees=c(10,20,50,100,500,1000),shrinkage=c(0.01,0.05,0.1,0.5),n.minobsinnode = c(3,5,10),interaction.depth=c(1,5,10))
# training the model
model_gbm<-train(trainSet[,Topfivepred],trainSet[,outcomeName],method='gbm',trControl=fitControl,tuneGrid=grid)
# summarizing the model
print(model_gbm)
modelLookup(model='gbm')
#model_gbm%>%as.list.data.frame()%>%kable()
save(model_gbm,file="model_gbm.RData")
load("model_gbm.RData")
model_gbm$bestTune%>%kable()
| n.trees | interaction.depth | shrinkage | n.minobsinnode | |
|---|---|---|---|---|
| 54 | 1000 | 10 | 0.01 | 10 |
model_gbm$results%>%head()%>%kable()
| shrinkage | interaction.depth | n.minobsinnode | n.trees | Accuracy | Kappa | AccuracySD | KappaSD | |
|---|---|---|---|---|---|---|---|---|
| 1 | 0.01 | 1 | 3 | 10 | 0.8829979 | 0 | 7.02e-05 | 0 |
| 7 | 0.01 | 1 | 5 | 10 | 0.8829979 | 0 | 7.02e-05 | 0 |
| 13 | 0.01 | 1 | 10 | 10 | 0.8829979 | 0 | 7.02e-05 | 0 |
| 55 | 0.05 | 1 | 3 | 10 | 0.8829979 | 0 | 7.02e-05 | 0 |
| 61 | 0.05 | 1 | 5 | 10 | 0.8829979 | 0 | 7.02e-05 | 0 |
| 67 | 0.05 | 1 | 10 | 10 | 0.8829979 | 0 | 7.02e-05 | 0 |
#various of finding the row with maximum accuracy
model_gbm$results[which.max(model_gbm$results$Accuracy),]
model_gbm$results%>%filter()%>%dplyr::summarise(max1=max(Accuracy))
model_gbm$results %>% dplyr::slice(which.min(Accuracy ))
model_gbm$results%>%dplyr::slice(which.max(Accuracy ))
model_gbm$results[ which(model_gbm$results$Accuracy ==max(model_gbm$results$Accuracy)) ,]
plot(model_gbm)
#
# pred<-predict(model_gbm,iris_test)
#
# Conf_matrix<-confusionMatrix(pred,iris[1:5,5])
#
# kable(Conf_matrix$table)
#using tune length
fitControl <- trainControl(method = "repeatedcv",number = 5,repeats = 5,
allowParallel = TRUE )
model_gbm2<-train(trainSet[,Topfivepred],trainSet[,outcomeName],method='gbm',
trControl=fitControl)
We can use tuneLength instead of specifying the value of each parameter. This allows any number of possible values for each tuning parameter through tuneLength.
# model_gbm3<-train(trainSet[,Topfivepred],trainSet[,outcomeName],method='gbm',
# trControl=fitControl,interaction.depth=10,n.trees=100,n.minobsinnode=10)
# print(model_gbm)
save(model_gbm2,file="model_gbm2.RData")
load("model_gbm2.RData")
print(model_gbm2)
Stochastic Gradient Boosting
31649 samples
5 predictor
2 classes: '0', '1'
No pre-processing
Resampling: Cross-Validated (5 fold, repeated 5 times)
Summary of sample sizes: 25319, 25319, 25319, 25319, 25320, 25319, ...
Resampling results across tuning parameters:
interaction.depth n.trees Accuracy Kappa
1 50 0.8964577 0.2749744
1 100 0.9012330 0.4017715
1 150 0.9022024 0.4202335
2 50 0.9018471 0.4158550
2 100 0.9023276 0.4321845
2 150 0.9025777 0.4319439
3 50 0.9022814 0.4301685
3 100 0.9021306 0.4317884
3 150 0.9020157 0.4264921
Tuning parameter 'shrinkage' was held constant at a value of 0.1
Tuning parameter 'n.minobsinnode'
was held constant at a value of 10
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were n.trees = 150, interaction.depth = 2, shrinkage = 0.1
and n.minobsinnode = 10.
plot(model_gbm2)
varImp(object=model_gbm2)
gbm variable importance
Overall
duration 100.000
poutcomesuccess 40.234
contactunknown 2.495
housingyes 2.389
monthmar 0.000
# Predict using the test data
pred<-predict(model_gbm2,testSet)
my_data=data.frame(cbind(predicted=pred,observed=testSet$y))
ggplot(my_data,aes(predicted,observed))+geom_point()+geom_smooth(method=lm)+ggtitle('Stochastic Gradient Boosting Machine')
# Print, plot variable importance
print(varImp(model_gbm2, scale = FALSE))
gbm variable importance
Overall
duration 1576.5
poutcomesuccess 707.1
contactunknown 158.2
housingyes 156.6
monthmar 121.9
plot(varImp(model_gbm2, scale = FALSE), main="Stochastic Gradient Boosting")
confusionMatrix(testSet$y,pred)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 11976 0
1 1586 0
Accuracy : 0.8831
95% CI : (0.8775, 0.8884)
No Information Rate : 1
P-Value [Acc > NIR] : 1
Kappa : 0
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.8831
Specificity : NA
Pos Pred Value : NA
Neg Pred Value : NA
Prevalence : 1.0000
Detection Rate : 0.8831
Detection Prevalence : 0.8831
Balanced Accuracy : NA
'Positive' Class : 0
The maximum accuracy of 0.9059 occurs at these parameter combinations shrinkage=0.01,interaction.depth=10,n.minobsinnode=10 and n.trees=1000. The mew model will be fitted with these parameter values.
Neural networks
model_nnet<-train(trainSet[,Topfivepred],trainSet[,outcomeName],method="nnet",trControl=fitControl)
save(model_nnet,file="model_nnet.RData")
load("model_nnet.RData")
print(model_nnet)
Neural Network
31649 samples
5 predictor
2 classes: '0', '1'
No pre-processing
Resampling: Cross-Validated (5 fold, repeated 5 times)
Summary of sample sizes: 25319, 25319, 25319, 25320, 25319, 25320, ...
Resampling results across tuning parameters:
size decay Accuracy Kappa
1 0e+00 0.8888624 0.1414151
1 1e-04 0.8901819 0.1786869
1 1e-01 0.8933553 0.2594665
3 0e+00 0.8977597 0.3529213
3 1e-04 0.8971071 0.3353283
3 1e-01 0.9027078 0.4248171
5 0e+00 0.9004840 0.3994720
5 1e-04 0.9004154 0.3905834
5 1e-01 0.9027372 0.4255974
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were size = 5 and decay = 0.1.
plot(model_nnet)
varImp(object=model_nnet)
nnet variable importance
Overall
contactunknown 100.00
poutcomesuccess 88.71
duration 63.52
housingyes 17.35
monthmar 0.00
# Predict using the test data
pred<-predict(model_nnet,testSet)
my_data=data.frame(cbind(predicted=pred,observed=testSet$y))
ggplot(my_data,aes(predicted,observed))+geom_point()+geom_smooth(method=lm)+ggtitle('Stochastic Gradient Boosting Machine')
# Print, plot variable importance
print(varImp(model_nnet, scale = FALSE))
nnet variable importance
Overall
contactunknown 30.485
poutcomesuccess 27.916
duration 22.185
housingyes 11.681
monthmar 7.732
plot(varImp(model_nnet, scale = FALSE), main="Stochastic Gradient Boosting")
confusionMatrix(testSet$y,pred)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 11654 322
1 991 595
Accuracy : 0.9032
95% CI : (0.8981, 0.9081)
No Information Rate : 0.9324
P-Value [Acc > NIR] : 1
Kappa : 0.4263
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.9216
Specificity : 0.6489
Pos Pred Value : 0.9731
Neg Pred Value : 0.3752
Prevalence : 0.9324
Detection Rate : 0.8593
Detection Prevalence : 0.8831
Balanced Accuracy : 0.7852
'Positive' Class : 0
stopImplicitCluster()